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Abstract 

m ; 

Under certain conditions (known as the Restricted Isometry Property or RIP) on the m x A^- 
.^ \ matrix $ (where m < N), vectors x G M.^ that are sparse (i.e. have most of their entries equal 

►7- ' to zero) can be recovered exactly from y := $a; even though $~^(y) is typically an (TV — m)- 

dimensional hyperplane; in addition x is then equal to the element in ^~^{y) of minimal ^i-norm. 
This minimal element can be identified via linear programming algorithms. 
We study an alternative method of determining x, as the limit of an Iteratively Re-weighted 
Least Squares (IRLS) algorithm. The main step of this IRLS finds, for a given weight vector 
w, the element in ^~^{y) with smallest £2(w)-norm. If a;*^"^ is the solution at iteration step n, 
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^ ' then the new weight w^"' is defined by w.^^ .= |a^i I + e„ , i = 1, . . . , TV, for a decreasing 
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sequence of adaptively defined e„; this updated weight is then used to obtain a;^"'"'"^^ and the 



in 

l/^ . process is repeated. We prove that when $ satisfies the RIP conditions, the sequence cc*^"^ 

^^ I converges for all y, regardless of whether $^^(j/) contains a sparse vector. If there is a sparse 

C^^ ' vector in $^^(2/), then the limit is this sparse vector, and when x^"-* is sufficiently close to the 

limit, the remaining steps of the algorithm converge exponentially fast {linear convergence in 



f^ , the terminology of numerical optimization). The same algorithm with the "heavier" weight 

l+r/2 

, i ~ I, . . . , N, where < r < 1, can recover sparse solutions as well; 
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K> , more importantly, we show its local convergence is superlinear and approaches a quadratic rate 

?H ' for r approaching to zero. 

1 Introduction 

Let <I> be an m X A^ matrix with m < N and let y G M™. (In the compressed sensing application 
that motivated this study, <I> typically has full rank, i.e. Ran($) = M™. We shall implicitly assume. 
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throughout the paper, that this is the case. Our results still hold for the case where Ran($) C M'", 
with the proviso that y must then lie in Ran($).) 
The linear system of equations 

^x = y (1.1) 

is under determined, and has infinitely many solutions. If AA := AA(<1>) is the null space of $ and xq is 
any solution to (II. Ih then the set T{y) := ^~^{y) of all solutions to (jl.l|) is given by J^{y) = xq+M. 

In the absence of any other information, no solution to (11. ip is to be preferred over any other. 
However, many scientific applications work under the assumption that the desired solution x £ J'{y) 
is either sparse or well approximated by (a) sparse vector (s). Here and later, we say a vector has 
sparsity k (or is k- sparse) if it has at most k nonzero coordinates. Suppose then that we know 
that the desired solution of (jl.ip is fc-sparse, where k < m is known. How could we find such an 
X? One possibility is to consider any set T of k column indices and find the least squares solution 
x^ := aigm.m^^-p/y\ \\^tz — y\\e^, where $t is obtained from $ by setting to zero all entries that 
are not in columns from T. Finding x'^ is numerically simple (see ()1.9p ). After finding each x'^, we 
choose the particular set T* that minimizes the residual \\^tz — y\\e"^ ■ This would find a A:-sparse 
solution (if it exists), x* = x . However, this naive method is numerically prohibitive when A^ 
and k are large, since it requires solving (^.) least squares problems. 

An attractive alternative to the naive minimization is its convex relaxation that consists in 
selecting the element in J^(y) which has minimal £i-norm: 

X := argmin ||2;||^iv. (1-2) 



z&:F(y) 



Here and later we use the ^p-norms 



p 




x\\,.:={ V^-il-^^l-; ' -^P^--^ (1.3) 



Under certain assumptions on $ and y that we shall describe in §21 it is known that (jl.2p has a 
unique solution (which we shall denote by x*), and that, when there is a fe-sparse solution to (jl.ip . 
(jl.2p will find this solution [3l [71 [20l [21] . Because the problem (jl.2p can be formulated as a linear 
program, it is numerically tractable. 

Solving underdetermined systems by £i-minimization has a long history. It is at the heart of 
many numerical algorithms for approximation, compression, and statistical estimation. The use 
of the £i-norm as a sparsity-promoting functional can be found first in reflection seismology and 
in deconvolution of seismic traces [161 \37\ I38j . Rigorous results for £i-minimization began to ap- 
pear in the late-1980's, with Donoho and Stark [23] and Donoho and Logan [22]. Applications 
for £i-minimization in statistical estimation began in the mid-1990's with the introduction of the 
LASSO and related formulations [39] (iterative soft-thresholding), also known as Basis Pursuit 
[15| . proposed in compression applications for extracting the sparsest signal representation from 
highly overcomplete frames. Around the same time other signal processing groups started using li- 
minimization for the analysis of sparse signals; see, e.g. [32]. The applications and understanding 



of ^1-minimization saw a dramatic increase in the last 5 years [201 [231 IHl ISSl El HI [3l [6], with the 
development of fairly general mathematical frameworks in which ^i-minimization, known heuristi- 
cally to be sparsity-promoting, can be proved to recover sparse solutions exactly. We shall not trace 
all the relevant results and applications; a detailed history is beyond the scope of this introduction. 
We refer the reader to the survey papers OH]. The reader can also find a comprehensive collection 



of the ongoing recent developments at the web-site http : //www . dsp . ece . rice . edu/cs/ , In fact, 
£i-minimization has been so surprisingly effective in several applications, that Candes, Wakin, and 
Boyd call it the "modern least squares" in [8]. We thus clearly need efficient algorithms for the 
minimization problem ()1.2I) . 

Several alternatives to (|1.2p . see, e.g., [Ml EI], have been proposed as possibly more efficient 
numerically, or simpler to implement by non-experts, than standard algorithms for linear program- 
ming (such as interior point or barrier methods). In this paper we clarify fine convergence properties 
of one such alternative method, called Iteratively Re-weighted Least Squares minimization (IRLS). 
It begins with the following observation (see ^for details). If (jl.2p has a solution x* that has no 
vanishing coordinates, then the (unique!) solution x^ of the weighted least squares problem 

x'" := argmin llzll^iVf^l, w := {wi, . . . ,wi^), where u;,- := Ix*]"""", (1.4) 

coincides with x* . (The following argument provides a short proof by contradiction of this state- 
ment. Assume that x* is not the £^(tt;)-minimizer. Then there exists rj ^ J\f such that ||x* + 
^II^H < ll^*ll^^(u,) °^ equivalents \M\]n^^^ < -Ylf=iWjVjX* = E;JLi ^i sign(x*). However, 
because x* is an ^i-minimizer, we have ||x*||fj ^ ||x* -|- hijWi^ for all h ^ 0; taking h sufficiently 
small, this implies X]7=i Vj sign(x*) = 0, a contradiction.) 

Since we do not know x*, this observation cannot be used directly. However, it leads to the 
following paradigm for finding x*. We choose a starting weight w^ and solve (|1.4p for this weight. 
We then use this solution to define a new weight w^ and repeat this process. An IRLS algorithm 
of this type appears for the first time in the approximation practice in the Ph.D. thesis of Lawson 
in 1961 [30], in the form of an algorithm for solving uniform approximation problems, in partic- 
ular by Chebyshev polynomials, by means of limits of weighted ip-norm solutions. This iterative 
algorithm is now well-known in classical approximation theory as Lawson's algorithm. In [T7] it 
is proved that this algorithm has in principle a linear convergence rate. In the 1970s extensions 
of Lawson's algorithm for £p-minimization, and in particular £i-minimization, were proposed. In 
signal analysis, IRLS was proposed as a technique to build algorithms for sparse signal reconstruc- 
tion in [5S]. Perhaps the most comprehensive mathematical analysis of the performance of IRLS 
for ^p-minimization was given in the work of Osborne [33| . 

Osborne proves that a suitable IRLS method is convergent for 1 < p < 3. For p = 1, if ty" 
denotes the weight at the nth iteration and x" the minimal weighted least squares solution for this 
weight, then the algorithm considered by Osborne defines the new weight w""''^ coordinatewise as 
w"'^ := Ix"!"^. His main conclusion in this case is that if the £i minimization problem (jl.2p has a 
unique solution, then the algorithm converges to this solution, in principle with linear convergence 



rate, i.e. exponentially fast, with a constant "contraction factor". 

However, the analysis of Osborne does not take into consideration what happens if one of the 
coordinates vanishes at some iteration n, i.e. x^ = 0. Taking this to impose that the corresponding 
weight component w"^^ must "equal" oo leads to x"^ = at the next iteration as well; this then 
persists in all later iterations. If x*^ = 0, all is well, but if there is an index j for which x*j ^ 0, 
yet x^ = at some iteration step n, then this "infinite weight" prescription leads to problems. In 
practice, this is avoided by changing the definition of the weight at coordinates j where x"^ = 
(see |31j and [101 [27] where a variant for total variation minimization is studied); such modified 
algorithms need no longer converge to x* , however). Because Osborne's convergence proof is local, 
it implies that if the iterations begin with a vector sufficiently close to the solution, and if the 
solution is unique and has only nonzero entries, then none of the x^ = vanish, and the weight- 
change is not required; Osborne's analysis does indeed show the linear convergence rate of the 
algorithm under these assumptions. Unfortunately, as we will see in Remark 12. 2[ the uniqueness of 
the solution necessarily implies that it has vanishing components. In other words, the set of vectors 
to which Osborne's analysis applies is vacuous. 

The purpose of the present paper is to put forward an IRLS algorithm that gives a re- weighting 
without infinite components in the weight, and to provide an analysis of this algorithm, with various 
results about its convergence and rate of convergence. It turns out that care must be taken in just 
how the new weight u;""*"^ is derived from the solution x" of the current weighted least squares 
problem. To manage this difficulty, we shall consider a very specific recipe for generating the 
weights. Other recipes are certainly possible. 

Given a real number e > and a weight vector w E M^, with Wj > 0, j = I, . . . , N , we define 



Jiz,w,€) := - 



N N 



zew\ (1.5) 



Given w and e, the element z G M that minimizes J^ is unique because ^ is strictly convex. 

Our algorithm will use an alternating method for choosing minimizers and weights based on 
the functional J^. To describe this, we define for z S M^ the non-increasing rearrangement r{z) of 
the absolute values of the entries of z. Thus r{z)i is the i-th largest element of the set {\zj\, j = 
1, . . . , N}, and a vector v is A;-sparse iff r{v)k+i = 0. 

Algorithm 1 We initialize by taking w^ := (1, . . . , 1). We also set eo := 1. We then recursively 
define for n = 0, 1, . . . , 

x"'^^ := argmin J'{z,w^,en) = argmin llzll^^fio'') (1-6) 



and 






T" I 'T* i~ 1 



En+i := min(e„, ' '"^ ^"^^ ), (1.7) 



where i^T is a fixed integer that will be described more fully later. We also define 



t(;"+i := argmin J (x"-^^ , w , €n+i) ■ (1-^ 

«)>0 



We stop the algorithm if e„ = 0; in this case we define x^ := x" for j > n. However, in general, the 
algorithm will generate an infinite sequence (x")„gN of distinct vectors. D 

Each step of the algorithm requires the solution of a least squares problem. In matrix form 



X 



-+i = Dn^\<^Dn<^')-^y, (1.9) 



where D„ is the N x N diagonal matrix whose j-th diagonal entry is w"^ and j4* denotes the 
transpose of the matrix A. Once x"""*"^ is found, the weight w^~^^ is given by 

w]+' = [ix]+')' + el_,,r'/^ j = l,...,iV. (1.10) 

We shall prove several results about the convergence and rate of convergence of this algorithm. 
This will be done under the following assumption on ^. 

The Restricted Isometry Property (RIP): We say that the matrix <1> satisfies the Re- 
stricted Isometry Property of order L with constant 5 G (0, 1) if for each vector z with sparsity L 
we have 

{l-5)\\z\\,N^\\^z\U^^{l + 6)\\z\\,N. (1.11) 

The RIP was introduced by Candes and Tao O S] in their study of compressed sensing and £i- 
minimization. It has several analytical and geometrical interpretations that will be discussed in 
§31 To mention just one of these results (see [IB]), it is known that if <1> has the RIP of order 
L := J + J', with 6 < /j7~ /f (here J' > J) and if (jl.ip has a J-sparse solution z G ^{y), then this 
solution is the unique ii minimizer in J-{y). (This can still be sharpened: in [9], Candes showed 
that if J-{y) contains a J-sparse vector, and if <^ has RIP of order 2 J with 5 < v2 — 1, then that 
J-sparse vector is unique and is the unique li minimizer in T{y).) 

The main result of this paper (Theorem l5.3p is that whenever $ satisfies the RIP of order K+K' 
(for some K' > K) and 5 sufficiently close to zero, then Algorithm [1] converges to a solution x of 
(jl.ip for each y G M'". Moreover, if there is a solution z to (jl.ip that has sparsity k ^ K — n, then 
X = z. Here k > 1 depends on the RIP constant 6 and can be made arbitrarily close to 1 when 6 
is made small. The result cited in our previous paragraph implies that in this case x = x* , where 
X* is the £i-minimal solution to (jl.ip . 

A second part of our analysis concerns rates of convergence. We shall show that if (jl.ip has a 
/c-sparse solution with, e.g., k ^ K — 4 and if $ satisfies the RIP of order 3K with 6 sufficiently close 
to zero, then Algorithm 1 converges exponentially fast to x = x*. Namely, once x'^° is sufficiently 
close to its limit x, we have 



X — x^^-^ll^jv ^ ^||x — x^ll^jv, n ^ no, (1-12) 



where fJ. < 1 is a fixed constant (depending on 6). From this result it follows that we have 
exponential convergence to x whenever x is A;-sparse; however we have no real information on how 
long it will take before the iterates enter the region where we can control fi. (Note that this is similar 
to convergence results for the interior point algorithms that can be used for direct -£i-minimization.) 



The potential of IRLS algorithms, tailored to mimic ^i-minimization and so recover sparse 
solutions, has recently been investigated numerically by Chartrand and several co-authors |1H [T2t 
I14j . Our work provides proofs of several findings listed in these works. 

One of the virtues of our approach is that, with minor technical modifications, it allows a similar 
detailed analysis of IRLS algorithms with weights that promote the non-convex optimization of ir- 
norms for < r < 1. We can show not only that these algorithms can again recover sparse 
solutions, but also that their local rate of convergence is superlinear and tends to quadratic when 
r tends to zero. Thus we also justify theoretically the recent numerical results by Chartrand et al. 
concerning such non-convex ^T-norm optimization [111 [T2l [T3| l36] . 

An outline of our paper is the following. In the next section we make some remarks about 
ii- and weighted ^2-miiiimization, upon which we shall call in our proof. In the following section, 
we recall the Restricted Isometry Property and the Null Space Property including some of its 
consequences that are important to our analysis. In section U we gather some preliminary results 
we shall need to prove our main convergence result. Theorem 15. 31 which is formulated and proved in 
section [5j We then turn to the issue on rate of convergence in section [6l In section [7] we generalize 
the convergence results obtained for ^i-minimization to the case of ^^-spaces for < r < 1; in 
particular, we show, with Theorem 17.91 the local superlinear convergence of the IRLS algorithm in 
this setting. We conclude the paper with a short section dedicated to a few numerical examples 
that dovetail nicely with the theoretical results. 

2 Characterization of ii- and weighted ^2-mininiizers 

We fix y £ M™ and consider the underdetermined system <I>x = y. Given a norm || • ||, the problem 
of minimizing ||z|| over z £ ^{y) can be viewed as a problem of approximation. Namely, for any 
xq € ^(y), we can characterize the minimizers in J^(y) as exactly those elements z G J^{y) that 
can be written as z = xq + r], with ij a best approximation to —xq from M. In this way one 
can characterize minimizers z from classical results on best approximation in normed spaces. We 
consider two examples of this in the present section, corresponding to the £i-norm and the weighted 
^2(it')-norm. 

Throughout this paper, we shall denote by x any element from J-{y) that has smallest ^i-norm, 
as in (II. 2p . When x is unique, we shall emphasize this by denoting it by x* . In general, x and 
X* need not be sparse, although we will often consider cases where they are. We begin with the 
following well-known lemma (see for example Pinkus [34]) which characterizes the minimal ^i-norm 
elements from J-{y). 

Lemma 2.1 An element x £ ^{y) has minimal ii-norm among all elements z £ J'{y) if and only 

^f 

I ^ sign{xi)rii\ < ^ \j]i\, r] £ M . (2.1) 



Moreover, x is unique if and only if we have strict inequality in (|2.ip for all r] ^ J\f which are not 
identically zero. 

Proof: We give the simple proof for completeness of this paper. If x G J~{y) has minimum 
£i-norm, then we have, for any rj £ M and any t G M, 

JV N 



^\xi + ti],\^^\xi\. (2.2) 



i=l i=l 

Fix r] £ A^. If t is sufficiently small then Xi + tr]i and Xi will have the same sign Si := sign(xj 
whenever Xi ^ 0. Hence, (|2.2p can be written as 

t ^ SjT/j + ^ \tr]i\ ^ 0. 



Choosing t of an appropriate sign, we see that (j2.1|) is a necessary condition. 

For the opposite direction, we note that if (j2.ip holds then for each r] € M, we have 

N 

N 

< X] Si(xi + ?7i) + X^ |?7i| ^ X]|xi + T/i|, (2.3) 

Xij^O Xi=0 i=l 



where the first inequality uses (j2.ip . 

If X is unique then we have strict inequality in (j2.2p and hence subsequently in (j2.ip . If we have 
strict inequality in (j2.ip then the subsequent strict inequality in (j2.3p implies uniqueness. ■ 



Remark 2.2 Applying Lemma l2. II to the special case of ^i-minimizers with no vanishing entries, 
we see that a vector x G J^iu), with Xj 7^ for all i = 1, . . . , A^, is a minimal £i-norm solution if 

and only if 

N 

X] Sirii = 0, for all 77 G AA. (2.4) 

i=l 

This implies that a minimal ^i-norm solution to <I>x = y for which all entries are non-vanishing is 
necessarily non-unique, by the following argument. Suppose that Xj 7^ for all i = 1, . . . , A^ and 
that X G J^{y) is a minimal £i-norm solution. Pick now any rj G M, r] ^ 0, and pick t > so that 
t < min;y.^o |2;j|/|??i|; it then follows that Sj = sign(xj + tr/i) for all i = 1, . . . , A^. But then we have 
"^1=1 \xi + tT]i\ = J2i=i ^i{xi + i^i) = Ylii=i l^«l ^Y (EH]), SO that X + iry is also a minimal solution, 
different from x. Hence, unique £i-minimizers are necessarily fc-sparse for some k < N. D 

We next consider minimization in a weighted £2(^)-iiorm. We suppose that the weight w is 
strictly positive which we define to mean that Wj > for all j G {1, . . . , A^}. In this case, i2i'w) is 

7 



a Hilbert space with the inner product 



N 



{u,v)ui := ^ 



WjUjVj. 



We define 



Because the 



j=i 



X :=argmin||z||^Ar(^). 



(2.5) 



(2.6) 



\i^{w) 



-norm is strictly convex, the minimizer x'^ is necessarily unique; it is completely 



characterized by the orthogonality conditions 

{x'",7])^=0, Vr?GAA. 



(2.7) 



Namely, x^ necessarily satisfies l\2.7\i : on the other hand, any element z £ J'{y) that satisfies 
(z, rj)y^ = for all ry G AA is automatically equal to x^ . 

At this point, we would like to tabulate some of the notation we have used in this paper to 
denote various kinds of minimizers and other solutions alike (such as limits of algorithms). 



an (arbitrary) element of J-{y) 



any solution of min lUIL, 

z&:F{y) 



unique solution of min II^^H^^ (notation used only when the minimizer is unique) 



unique solution of min ||z||f2(uj)) ^j > foi' ^^ J 

z<^T{y) 



limit of Algorithm [T] 



unique solution of min /e(2;); see ([5 

z<=iT{y) 



Table 1: Notation for solutions and minimizers. 



3 The Restricted Isometry and the Null Space Properties 

To analyze the convergence of our algorithm, we shall impose the Restricted Isometry Property 
(RIP) already mentioned in the introduction, or a slightly weaker version, the Null Space Property, 
which will be defined below. Recall that <I> satisfies RIP of order L for 6 € (0, 1) (see (II. lip ) iff 



(1 



\i^ 



^ \\^z\U ^{l + 5)\\z\ 



i^^ 



for all L-sparse z. 



(3.1) 



It is known that many families of matrices satisfy the RIP. While there are deterministic families 
that are known to satisfy RIP, the largest range of L, (asymptotically, as TV — > oo, with e.g. m/N 
kept constant) is obtained (to date) by using random families. For example, random families in 
which the entries of the matrix $ are independent realizations of a (fixed) Gaussian or Bernoulli 



random variable are known to have the RIP with high probabihty for each L ^ cq{5) n/ log n (see 
[71 m m [35] for a discussion of these results) . 

We shall say that $ has the Null Space Property (NSP) of order L for 7 > if [1[ 

hrWii ^ 7lkr=||£i, (3.2) 

for all sets T of cardinality not exceeding L and all tj G N . Here and later, we denote by r]s the 
vector obtained from i] by setting to zero all coordinates 77J for i ^ 5 C {1, 2, . . . , A^}; T^ denotes 
the complement of the set T. It is shown in Lemma 4.1 of [18] that if ^ has the RIP of order 
L := J + J' for a given 5 G (0, 1), where J,J'^1 are integers, then $ has the NSP of order K for 
7 := j^ \ rji- Note that if J' is sufficiently large then 7 < 1. 

Another result in [18] (see also Lemma 14.31 below) states that in order to guarantee that a 
/c-sparse vector x* is the unique £i-minimizer in J-'{y), it is sufficient that <I> has the NSP of order 
L ^ k and 7 < 1. (In fact, the argument in [4], proving that for <I> with the RIP, ^i-minimization 
identifies sparse vectors in J-{y), can be split into two steps: one that implicitly derives the NSP 
from the RIP, and the remainder of the proof, which uses only the NSP.) 

Note that if the NSP holds for some order Lq and constant 70 (not necessarily < 1), then, 
by choosing a > sufficiently small, one can ensure that $ has the NSP of order L = uLq with 
constant 7 < 1 (see [18] for details) . So the effect of requiring that 7 < 1 is tantamount to reducing 
the range of L slightly. 

When proving results on the convergence of our algorithm later in this paper, we shall state 
them under the assumptions that $ has the NSP for some 7 < 1 and an appropriate value of L. 
Using the observations above, they can easily be rephrased in terms of RIP bounds for $. 

4 Preliminary results 

We first make some comments about the decreasing rearrangement r(z) and the j'-term approxima- 
tion errors for vectors in R . Let us denote by T,^ the set of all x G M such that ^(supp(a;)) ^ k. 
For any z G M and any j = 1, 2, . . . , A^, we denote by 

(Tj{z)i^ := inf \\z - w\\^n (4.1) 

the £i-error in approximating a general vector z G M by a j-sparse vector. Note that these 
approximation errors can be written as a sum of entries of r{u): aj{z)i-^ = ^yyjT{z)y. We have 
the following lemma: 



^This definition of the Null Space Property is a slight variant of that given in [18] but is more convenient for the 
results in the present paper. 



Lemma 4.1 The map z h^ r{z) is Lipschitz continuous on (R , \\ ■ \\e^): for any z,z' G M. , we 
have 

Mz)-r{z')\U^^\\z-z'\U^. (4.2) 

Moreover, for any j, we have 

Wjiz)ii - o"j(^')^il ^ Ik - ^'Iki' (4-3) 

and for any J > j, we have 

{J-jMz)j^\\z-z'\U,+aj{zX. (4.4) 

Proof: For any pair of points z and z' , and any j G {1, . . . , A^}, let A be a set of j — 1 indices 
corresponding to the j — 1 largest entries in z' . Then 

r{z)j ^ max|2j| ^ max|z-| + \\z - z'\\e^ = r{z')j + \\z - z'\\e^. (4.5) 

We can also reverse the roles of z and z' . Therefore, we obtain (j4.2p . To prove (j4.3p . we approximate 
z by a j-term best approximation u G T,j of z' in ii . Then 

crjiz)e-^ ^ 11^ - ""Ik ^ \\z - z'We-^ + aj{z')e^, 

and the result follows from symmetry. 

To prove (j4.4p . it suffices to note that (J — j) r{z)j ^ aj{z)(^^. ■ 

Our next result is an approximate reverse triangle inequality for points in J-{y)- Its importance 
to us lies in its implication that whenever two points z, z' G ^{y) have close £i-norms and one of 
them is close to a /c-sparse vector, then they necessarily are close to each other. (Note that it also 
implies that the other vector must then also be close to that A;-sparse vector.) This is a geometric 
property of the null space. 

Lemma 4.2 Assume that (|3.2p holds for some L and 7 < 1. Then, for any z, z' G J^{y), we have 

\\z' - z\\,, ^ \^ {\\z'\\,, - \\z\\,, + 2cjL{z)i,) . (4.6) 

1 — 7 

Proof: Let T be a set of indices of the L largest entries in z. Then 



[z 



' - z)TA\ii ^ 



/ 



z'Wh - Wz'tWii +(^L{z)e.^ 

z\\h + Ik'll^i - ll^lki - II^tII^i +'^l{z)i>^ 

zt\W - Wz'xWh + ll^'lki - \\z\\e.i + 2aL{z)e^ 
^ \\{z' - z)T\\i^ + \\z'\\ij^-\\z\\i-^+2aLiz)ij^. (4.7) 

Using (j3.2p . this gives 

\\{z' - z)t\U, ^ 7\\{z' - z)T4h ^ liWiz' - z)t\U, + \\z'\W - \\z\W + 2aL{z)i,). (4.8) 
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In other words, 



(z' - z)t\W ^ T^— (Ik'Iki - MW + 2fTL(^k). (4.9) 

1-7 



Using this, together with (j4.7p . we obtain 

\\z' - z\\i, = \\{z' - z)T4h + ll(^' - ^Mh ^ \^{\\z'\W - \\4h + 2(Tl(z)^J, (4.10) 

1-7 

as desired. ■ 



This result then allows the following simple proof of some of the results of [18] : 

Lemma 4.3 Assume that (13.21) holds for some L and 7 < 1. Suppose that J-{y) contains an L- 
sparse vector. Then this vector is the unique ii-minimizer in J-{y); denoting it by x* , we have 
moreover, for all v E J^{y), 

\\v-x*\U,^2l^aL{v)e,. (4.11) 

1-7 

Proof: For the time being, we denote the L-sparse vector in J-'{y) by Xg- 
Applying ()4.6p with z' = v and z = Xg, we find 

1 + 7 

\\v - XsWi:, ^ :; [\\v\\ii - \\Xs\\ii] ; 

1-7 

since v G J^{y) is arbitrary, this implies that \\v\\i^ — \\xs\\i-^ ^ for all v € J^{y), so that Xg is an 
^i-norm minimizer in J-{y). 

If x' were another £i-minimizer in J-^{y), then it would follow that ||a;'||£j = ||a;s||^^, and the 
inequality we just derived would imply \\x' — Xs\\i^ = 0, or x' = Xg- It follows that Xg is the unique 
^i-minimizer in J-'{y), which we denote by x* , as proposed earlier. 

Finally, we apply (j4.6|) with z' = x* and z = v, and we obtain 

1+7 1+7 
\\v -x*\\ ^ {\\x*\\i^ - \\v\\i^ + 2aL{v)i^) ^ 2- crL{v)i^ , 

1 — 7 1 ~ 7 

where we have used the ^i-minimization property of x*. ■ 

Our next set of remarks centers around the functional J^ defined by (jl.Sh . Note that for each 
n = 1,2, . . ., we have 

N 

Jix^+\w'^+\en+i) = ^[{x]+'f + el^,]'/'. (4.12) 

We also have the following monotonicity property which holds for all n ^ 0: 

J{x^+\w''+\en+i) ^ J{x^+\w",en+i) ^ J{x^^\w^,en) ^ J{x^,w^,en). (4.13) 

Here the first inequality follows from the minimization property that defines u)"""*"^, the second 
inequality from en+i ^ en, and the last inequality from the minimization property that defines 
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j^n+1^ For each n, x""^^ is completely determined by w^; for n = 0, in particular, x^ is determined 
solely by w^, and independent of the choice of x^ G ^iu)- (With the initial weight vector defined 
by w^ = (1, . . . , 1), x^ is the classical minimum ^2-iiorm element of J-{y).) The inequality (j4.13p 
for n = thus holds for arbitrary x" € J'{y)- 

Lemma 4.4 For each n ^ 1 we have 

\\x^e,^J{x\w'',eo)=:A (4.14) 

and 

w];^A-\ j = l,...,N. (4.15) 

Proof: The bound (jiHl) follows from (|i33]) and 

N 

The bound KTEh follows from {w'^y^ = [{x]f + el]^/'^ ^ J(x",'u;", e^) ^ A, where the last in- 
equality uses (I4.13p . ■ 



5 Convergence of the algorithm 

In this section, we prove that the algorithm converges. Our starting point is the following lemma 
that establishes (x" — x"""*"^) — > for n — > oo. 

Lemma 5.1 Given any y G W^, the x" satisfy 

oo 

^||x"+i-x"||| ^2^2. (5.1) 

n=l 

where A is the constant of Lemma 14. 4[ In particular, we have 

lim (x" - x'"+^) = 0. (5.2) 

n— +00 

Proof: For each n = 1,2,..., we have 



2[J(x^«;^6„)-J(x"+^^/;"+^e„+l)] ^ 2[J(x",«;",e„)-J(x"+\«;",eO] 

= (x"+x'^+\x"-x"+^)^n 

= (X"-X"+\X"-X" + I)^n 

12 



^ ^-^||x"-x"+i|||2, (5.3) 

where the third equahty uses the fact that (a;""'"^,x" — x^~^^)^n = (observe that x'"+^ — x^ ^ J\f 
and invoke (j2.7p ). and the inequahty uses the bound (|4.15j) on the weights. If we now sum these 
inequahties over n ^ 1, we arrive at (|5.ip . ■ 



From the nionotonicity of e„, we know that e := lini„_>oo Cn exists and is non-negative. The 
fohowing functional wih play an important role in our proof of convergence: 

Mz):=J2i^] + ey/\ (5.4) 

Notice that if we knew that x" converged to x then, in view of (j4.12p . fe{x) would be the limit of 
J{x'^, w'^, €n)- When e > the functional /<. is strictly convex and therefore has a unique niinimizer 

x^ := argmin/e(z). (5.5) 

This minimizer is characterized by the following lemma: 

Lemma 5.2 Let e > and z G ^{y)- Then z = x'' if and only if {z,rj)^(^z,e) = for all rj £ M, 
where w{z, e)i = [zf + e^]"^/^. 

Proof: For the "only if" part, let z = x'' and ?] G AA be arbitrary. Consider the analytic function 

G,{t):=f,{z + tr,)-f,{z). 

We have Ge(0) = 0, and by the minimization property Gf^{t) ^ for all t G M. Hence, G'^{Q) = 0. 
A simple calculation reveals that 

N 



^'^(") = Yl [^2 ^'g2]l/2 = (^'^)^5(.,.), 



which gives the desired result. 

For the "if" part, assume that z £ ^{y) and {z, rj)^^^^^^-^ = for all rj G M^ where w{z^ e) is 
defined as above. We shall show that z is a minimizer of /e on ^{y). Indeed, consider the convex 
univariate function [v? + e^]^'^. For any point uq we have from convexity that 

[U' + e^] V2 ^ [^2 ^ ^2]l/2 + [^2 + ,2]-l/2^^(^ _ ^^)^ (g^g) 

because the right side is the linear function which is tangent to this function at uq. It follows that 
for any point v G ^{y) we have 

N 

f,{v) ^ A(z) + J][z| + e^r^l^Zj{vj - Zj) = f,{z) + {z,v- z)^(,,,) = A(z), (5.7) 
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where we have used the orthogonaHty condition (j5.13p and the fact that v — z is in A^. Since v is 
arbitrary, it follows that z = x^, as claimed. ■ 

We now give the convergence of the algorithm. 



Theorem 5.3 Let K (the same index as used in the update rule (jl.7p ) be chosen so that <I> satisfies 
the Null Space Property (j3.2p of order K, with 7 < 1. Then, for each y G M™, the output of 
Algorithm{l\ converges to a vector x, with r{x)K+i = NXmin^ao^n o-nd the following hold: 
(i) If e = lim„^oo ^n = 0, then x is K -sparse; in this case there is therefore a unique ii-minimizer 
X* , and X = X* ; moreover, we have, for k ^ K , and any z G ^{y), 

2(1+7) 
\\z — xWn^ ^ cak{z)i^^ with c := — (5-8) 

(ii) If e = lim„_»oo Cn > 0, then x = x^; 

(iii) In this last case, if 7 satisfies the stricter bound 7 < 1 — -j^i^g T^'^; equivalently, if j^ < K), 

then we have, for all z G J'iy) and any k < K — j^, that 



2(1 + 7) 
\z — x\\i^ ^ cak(z)£-^, with c := — 



K-k+^ 

K-k-^ 

1-7 



(5.9) 



27 



As a consequence, this case is excluded if J-{y) contains a vector of sparsity k < K — j^ 



The constant c can be quite reasonable; for instance, if 7 ^ 1/2 and k ^ K — 3, then we have 

c ^ 9 i±21 ^ 27. 
\ 1-7 ^ 

Proof: Note that since e^+i < en, the e„ always converge. We start by considering the case 
e := lim„_^oo en = 0. 

Case e = 0: In this case, we want to prove that x" converges , and that its limit is an ii- 
minimizer. Suppose that e^ = for some uq. Then by the definition of the algorithm, we know 
that the iteration is stopped at n = no, and re" = x^o, n ^ uq. Therefore x = x"°. From the 
definition of e^, it then also follows that r(x"'o)x+i = and so x = x"'o is i^-sparse. As noted in 
^and Lemma |4.3| if a AT-sparse solution exists when $ satisfies the NSP of order K with 7 < 1, 
then it is the unique £i-minimizer. Therefore, x equals x* , this unique minimizer. 

Suppose now that e„, > for all n. Since e^ — > 0, there is an increasing sequence of indices (rij) 
such that em < ^m-i for all i. By the definition (|1.7p of {en)n&i-, we must have r{x'^^)K+i < Nem-i 
for all i. Noting that (j;")„gN is a bounded sequence, there exists a subsequence {pj)j<^n of (n,j)igi^ 
such that (a;P-')jgN converges to a point x G J^{y)- By LemmaHTH we know that r(x^j)i^+i converges 
to r{x)K+i- Hence we get 

r{x)K+i = lim r{xP^)K+i ^ lim iVep 1 = 0, (5.10) 
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which means that the support-width of x is at most K, i.e. x is -fC-sparse. By the same token used 
above, we again have that x = x* , the unique £i-minimizer. We must still show that x*^ — > x*. 
Since x^^ — > x* and e^^ — > 0, (j4.12p implies J{x'p^ , w'^^ , e^^. ) -^ ||a;* ||^^ . By the monotonicity property 
stated in (|4.13p . we get J {x"' , w"' , en) -^ Ik* Ik- Since (|4.12p implies 

J{x'',w'',en)-Neni^ ||x"||^, ^ J(x",ti;",e„,), (5.11) 

we obtain Hx^H^^ — > ||x*||£^. Finally, we invoke Lemma 14.21 with z' = x", z = x*, and k = K to get 

limsup||x"-x*||£, ^ii^f lim ||x"|U, - ||x*||£,') =0, (5.12) 

n.-»oo 1 — 7 V"-^o° ^ 

which completes the proof that x" -^ x* in this case. 

Finally, (j5.8p follows from (|4.1ip of Lemma 14.31 (with L = K), and the observation that an{z) ^ 
(Tn>{z) if ?^ ^ n' . 

Case e > 0: We shall first show that x" — > x*^, n ^ 00, with x^ as defined by (j5.5p . By Lemma 
14.41 we know that (x")^^ is a bounded sequence in M^ and hence this sequence has accumulation 
points. Let (x"*) be any convergent subsequence of (x") and let x G ^{y) be its limit. We want to 
show that X = x*^. 

Since w'^ = [(x^)^ + e2]-V2 ^ g-i^ it follows that Vmii^oowf = [{xjf + e^J-Va = w{x,e)j 
=: Wj, j = 1,... ,N. On the other hand, by invoking Lemma |5. 11 we now find that x"'^^ -^ x, 
i ^ 00. It then follows from the orthogonality relations (|2.7p that for every 7] £ M, we have 

{x, rj)^= lim (x"»+i ,rj)^n,=0. (5.13) 

i— too 

Now the "if" part of Lemma 15.21 implies that x = x^. Hence x^ is the unique accumulation point 
of (x"')neN and therefore its limit. This establishes (ii). 

To prove the error estimate (j5.9p stated in (iii), we first note that for any z £ ^{y), we have 

Iklki ^ fe{x') ^ fe{z) ^ \\z\\e, + Ne, (5.14) 

where the second inequality uses the minimizing property of x*^. Hence it follows that ||x^||£^ — 
1 1 -2 1 1^1 ^ Ne. We now invoke Lemma 14.21 to obtain 

Ik' - 4e^ ^ l^[Ne + 2ak{z)i,]. (5.15) 

1-7 

From Lemma |4. II and (jl.7p . we obtain 

A^e = lim Nen < lim r(x")i^+i = r{x^)K+i. (5.16) 

n—KX) n— too 

It follows from ([0|) that 

{K+l-k)Ne ^ iK + l-k)r(x')K+i 
^ Wx" - z\\i-^ + ak{z)i^ 
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^ \^[Ne + 2ak{z)t,]+ak{z)t,, (5.17) 

i — 7 



where the last inequahty uses (jS.lSp . Since by assumption on K, we have K — k > j^, i.e 



K -\- 1 — k > jr^, we obtain 



2{K-k) + 3 
(K-k)-^ 



Ne + 2akiz)t, ^ ^ ,, ^^ cjfc(z)^,. 



Using this back in (j5.15p . we arrive at (|5.9p . 

Finally, notice that if J-{y) contains a A:-sparse vector (with k < K— j^ ) , then we know already 
(see ^ that this must be the unique £i-minimizer x*; it then follows from our arguments above 
that we must have e = 0. Indeed, if we had e > 0, then (j5.17p would hold for z = x*; since x* is 
/c-sparse, (Jk{x*)i-^ = 0, implying e = 0, a contradiction with the assumption e > 0. This finishes 
the proof. 



Remark 5.4 Let us briefly compare our analysis of the IRLS algorithm with £i minimization. The 
latter recovers a fe-sparse solution (when one exists) if $ has the NSP of order K and k ^ K. The 
analysis given in our proof of Theorem 15.31 guarantees that our IRLS algorithm recovers A;-sparse x 
for a slightly smaller range of values k than £i-minimization, namely for k < K — jz~- Notice that 
this "gap" vanishes for vanishingly small 7. Although we have no examples to demonstrate, our 
arguments cannot exclude the case where J-{y) contains a fe-sparse vector x* with K — j^ ^ k ^ K 
(e.g., if 7 ^ 1/3 and k = K — 1), and our IRLS algorithm converges to x, yet x ^ x* . However, 
note that unless 7 is close to 1, the range of A:-values in this "gap" is fairly small; for instance, for 
7 < 3, this non-recovery of a fe-sparse x* can happen only if k = K. D 

Remark 5.5 The constant c in (j5.8p is clearly smaller than the constant c in (j5.9p : it follows that 

2_ 

-7' 



when k < K — j^, the estimate (j5.9p holds for all cases, regardless of whether e = or not. D 



6 Rate of Convergence 

Under the conditions of Theorem 15.31 the algorithm converges to a limit x; if there is a /c-sparse 
vector in J^{y) with k < K — j^, then this limit coincides with that fc-sparse vector, which is then 
also automatically the unique ^i-minimizer x* . In this section our goal is to establish a bound for 
the rate of convergence in both the sparse and non-sparse cases. In the latter case, the goal is to 
establish the rate at which x" approaches to a ball of radius CiO"fc(x*)£i centered at x* . We shall 
work under the same assumptions as in Theorem [ 
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6.1 Case of A;-sparse vectors 

Let us begin by assuming that J-{y) contains the A;-sparse vector x* . The algorithm produces the 
sequence x", which converges to x*, as estabhshed above. Let us denote the (unknown) support of 
the /c-sparse vector x* by T. 

We introduce an auxihary sequence of error vectors rj^ £ J\f via 77" := x" — x* and 

^n ■— \\v Iki — If ~^ Wi^- 

We know that En -^ 0. The following theorem gives a bound on the rate of convergence of En to 
zero. 

Theorem 6.1 Assume ^ satisfies NSP of order K with constant 7 such that < 7 < 1 — ^^ ^ . 
Suppose that k < K — j^ , < p < 1, and < 7 < 1 — j^r^ ^'^g such that 

7(1+7) / 1 \ 

Assume that J^{y) contains a k-sparse vector x* and let T = supp(x*). Let uq be such that 

En,^ ^ R* := p uim\x*\. (6.1) 

Then for all n ^ no, we have 

En+l ^ P-En- 

Consequently x" converges to x* exponentially. 

Remark 6.2 Notice that if 7 is sufficiently small, e.g. 7(1 + 7) < |, then for any k < K, there is 
a p > for which /i < 1, so we have exponential convergence to x* whenever x* is fc-sparse. D 

Proof: We start with the relation (|2.7p with w = w"", x^ = x"'~^^ = x* + r/""*"^, and r] = 

x""*"^ — X* = r]^~^^, which gives 

N 

i=l 

Rearranging the terms and using the fact that x* is supported on T, we get 

N 

Ek^^iv = -E<<^^< = -E .,n./:,.n/. <^^- (6-2) 

i=l JGT ieT ^^^i > ^^n\ ' 

We will prove the theorem by induction. Let us assume that we have shown En ^ R* already. 
We then have, for all i € T, 

K\ < lh"llff =En<p\x*\ , 

SO that 

<ff^=, '"-' ,^^ (6.3) 
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and hence (j6.2p combined with (j6.3p and NSP gives 

N 



EK^'I'^^Y^K^^Ik^Y 



7 ||^n+l| 

.I'/T- ii£i -^ :; 
P 



i=l 

At the same time, the Cauchy-Schwarz inequaUty combined with the above estimate yields 

< Y^^WVTt'WiAWv^i.+Nen). (6.4) 

If r/^^ = 0, then x'^t = 0. In this case x"+^ is /c-sparse and the algorithm has stopped by 
definition; since x"^^ — x* is in the null space Af, which contains no /c-sparse elements other than 
0, we have already obtained the solution a;""*"^ = x*. If rj'l^t ^ 0, then after canceling the factor 
||77^+^||£j in (fOj) . we get 



and thus 



h"+'lki = II^t'-'II^i + \\VTt%r ^ (1 + l)\\VTt%r ^ ^y^ i\mW + ^e^) • (6.5) 
Now, we also have by p.7p and f)4.4p 

1 1 1 r)^ 1 1 

iV6„ ^ r(x")A^+i < ___(||^- _ ^*||,^ + ^,(^*),J = jf^J^, (6.6) 

since by assumption ak{x*) = 0. This, together with (|6.5p . yields the desired bound, 

£„.■ = Ik-'lk < ^ (l + 5^^) ll,"lk = ,E„. 

In particular, since // < 1, we have -En+i ^ -R*, which completes the induction step. It follows that 
En+i ^ fJ-En for all n ^ no. ■ 



Remark 6.3 Note that the precise update rule (jl.7p for en does not really intervene in this analysis. 
If Eno ^ R* , then the estimate 

En+i ^ Po{En + A^e„) with //o := 7(1 + 7)/(l - p) , (6.7) 

guarantees that all further En will be bounded by R* as well, provided Nen ^ (/Uq — l)i?*. It 
is only in guaranteeing that ()6.ip must be satisfied for some no that the update rule plays a role: 
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indeed, by Theorem 15.31 i5^„ ^ for n ^ cxo if e.„ is updated following (jl.7p . so that (j6.ip has to 
be satisfied eventually. 

Other update rules may work as well. If (e„)„gN is defined so that it is a monotonically decreasing 
sequence with limit e, then the relation (j6.7p immediately implies that 

limsup£/„ ^ 



n— >oo 



1-^0 



In particular, if e = 0, then £"„ — > 0. The rate at which En ^^ in this case will depend on ^o as 
well as on the rate with which e„ — > 0. We shall not quantify this relation, except to note that if 
Cn = 0(/3") for some /? < 1, then En = 0{n'jl^) where Jl = max(/io,/3). □ 

6.2 Case of noisy A;-sparse vectors 

We show here that the exponential rate of convergence to a A;-sparse limit vector can be extended to 
the case where the "ideal" (i.e. fc-sparse) target vector has been corrupted by noise and is therefore 
only "approximately fc-sparse" . More precisely, we no longer assume that J^{y) contains a fe-sparse 
vector; consequently the limit x of the x" need not be an ^i-minimizer (see Theorem 15. 3p . If x is 
any ^i-minimizer in J-{y), Theorem 15.31 guarantees ||x — x||fj ^ Ccrk{x)i-^; since this is the best level 
of accuracy guaranteed in the limit, we are in this case interested only in how fast x" will converge 
to a ball centered at x with radius given by some (prearranged) multiple of ak{x)i-^. (Note that if 
J-^{y) contains several £i-minimizers, they all lie within a distance C'ak{x)(^^ of each other, so that 
it does not matter which x we pick.) We shall express the notion that z is "approximately /c-sparse 
with gap ratio C", or a "noisy version of a fc-sparse vector, with gap ratio C" by the condition 

where k is such that ^ has the NSP for some pair K, 7 such that Q ^ k < K — j^ (e.g. we could 
have K = k + \ii ^ < 1/2). If the gap ratio C is much greater than the constant Ci in (|5.9p . then 
exponential convergence can be exhibited for a meaningful number of iterations. Note that this 
class includes perturbations of any /c-sparse vector for which the perturbation is sufficiently small 
in £^-norm (when compared to the unperturbed fc-sparse vector). 

Our argument for the noisy case will closely resemble the case for the exact fc-sparse vectors. 
However there are some crucial differences that justify our decision to separate these two cases. 

We will be interested in only the case e > where we recall that e is the limit of the e„ occurring 
in the algorithm, This assumption implies (Jk{x)i^ > 0, and can only happen if x is not ii'-sparse. 
(As noted earlier, the exact A;-sparse case always corresponds toe = OifA;<i^— j^- For k in the 
region K — Yr~ ^ k ^ K, both e = and e > are theoretical possibilities.) 

First, we redefine ?]" = x" — x^, where x^ is the minimizer of /^ on J-{y) and e > 0. We know 
from Theorem 15.31 that r/" — > 0. We again set En = ||^"||£i- 
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Theorem 6.4 Given < p < 1, and integers k, K with k < K, assume that $ satisfies the NSP of 
order K with constant 7 such that all the conditions of Theorem 15.31 are satisfied and, in addition, 

^ l-p V K + l-k) 

Suppose z G T(^)) is ^'approximately k-sparse with gap ratio C", i.e. 

r{z)k > Cak{z)e, (6.8) 

with C ^ Ci, where Ci is as in Theorem I5.3[ Let T stand for the set of indices of the k largest 
entries of x*", and hq be such that 

Eno ^ R* '■= pioam\xi\ = pr{x'^)k- (6.9) 

Then for all n ^ no, we have 

En+i^pEn + Bak{z)e„ (6.10) 

where B > is a constant. Similarly, if we define En = \\x^ — zWi-^, then 

En+i^pEn + Bak{z)e„ (6.11) 

for n ^ hq, where B > is a constant. This implies that x" converges at an exponential (linear) 
rate to the ball of radius B{1 — p)~'^ak{z)i^ centered at z. 

Remark 6.5 Note that Theorem 15.31 triviahy imphes the inequalities (|6.10p and (j6.1ip in the 
hniit n ^ 00 since En -^ 0, ^^(z)^^ > 0, and ||x — z\\i-^ < C\ak{z)i^. However, Theorem 16.41 
quantifies the event when it is guaranteed that the two measures of error, En and £"„, must shrink 
(at least) by a factor /x < 1 at each iteration. As noted above, this corresponds to the range 
(^k{z)i-^ < En, En < r{x^)k, and would be realized if, say, z is the sum of a fc-sparse vector and a 
fully supported "noise" vector which is sufficiently small in li norm. In this sense, the theorem 
shows that the rate estimate of Theorem 15.31 extends to a neighborhood of A;-sparse vectors. 

Proof: First, note that the existence of no is guaranteed by the fact that En — > and -R* > 0. 
For the latter, note that Lemma |4. II and Theorem 15.31 implv 

r{x^)k ^ r{z)k - \\z - x^Wi^ ^ (C - Ci)akiz)i^, 

so that R* ^ p{C - Ci)crfc(z)£, > 0. 

We follow the proof of Theorem 16.11 and consider the orthogonality relation (|6.2p . Since x^ is 
not sparse in general, we rewrite (|6.2p as 



N N 



E i<^^i'< = - E -^v:^'< = - E ^ lix^ileiv/^ ^^"^'- ^'-''^ 



i=l i=l «eTUT= 
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We deal with the contribution on T in the same way as before: 



/ ■' Ifr^n 



^n+l 



ieT 



[(x7)2 + e2]V2 



< 



1 



Vt IU 



< 



^ ll^n+1 



1 'T Ki ^ 1 

I- p ' 1- p 



Vt'^ Iki 



For the contribution on T^, note that 



(3n '■= max- 



W:. 



n+li 



Since r/" — > we have /3„ ^ 0. It fohows that 



TH^e-%-^%^. 



2-^ r('^n^ 



,,^.[(^F)^ + 4]^/^ 



^. 



n+l 



^ /3nO-fc(x')£i ^ I3nicrkiz)i^ + ||x' - zjl^j ^ C2/?nO-fc(^)£i , (6.13) 



where the second inequahty is due to Lemma |4. 11 the last one to Theorem 15.31 and C2 = Ci + 1. 
Combining these two bounds, we get 



AT 



n+l|2,„n ^ 7 |L,n+l 



Ei^^'iv-^Y 



Ui + C2Pncrkiz)ei 



i=l 



We combine this again with a Cauchy-Schwarz estimate, to obtain 



Ir7"+l||2 < 






AT 



^ Ei^^'iV 



^ 



^ 



1- p 

7 ii^n+li 



n fE[i^"i + i^^i+^-]] 



<?i +C2/3„o-fc(2;)£^ (||?/t-IUi +^fc(2;')£i +A^e„ 



1-p 



1 + C2/3„(Tfc(z)^, (II^T^Iki + C2<7k{z)i, + A^e„) , (6.14) 



It is easy to check that if u^ ^ Au + B, where A and B are positive, then u ^ A + B/A. Applying 
this to It = ||T/^t \\e^ in the above estimate, we get 

.n+l|, / 7 



^ r^ [hT-ll^i + C2ak{z)e, + Nen] + C3/3„CTfc(z)£i, 



(6.15) 



where C3 = C2(l — p)/^- Similar to (|6.6p . we also have, by combining ()4.4p with (part of) the chain 
of inequalities (|6.13p , 



and consequently ()6.15p becomes 



x"-x'||^i +crfc(x')£j < 



K + l-k 



|r/"||£i+C2afc(z)^J, (6.16) 






(6.17) 



1-p 



K + l-k 
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h"lki + (1 + 7)(C3/3n + Q) afc(z)^, , 



where C4 = 6*27(1 — p) "^(1 + ^/{K + 1 — k)). Since the /3„ are bounded, this gives 

En+i ^ p.En + Bak{z)i-^. 

It then follows that if we pick Ji so that 1 > Jl > p,, and consider the range of n > no such that 
En^ {Ji- li)~^Bak{z)i^ =: r* , then 

-E-n+i ^ ^^En- 

Hence we are guaranteed exponential decay of E^ as long as x" is sufficiently far from its limit. 
The smallest possible value of r* corresponds to the case /i ~ 1. 

To establish a rate of convergence to a comparably-sized ball centered at z, we consider E^ = 
\\x'^ — z\\i^. It then follows that 

En+i ^ ||x"+^ -x'll^i + ||x'-2;||£i 

^ li\\x^ - x'-Wi^ + Bcrkiz)e^ + Ciakiz)e^ 

^ //||a;"' - zll^i + Bak{z)i^ + Ci(l + p)ak{z)i^ 

= pEn + Bakiz)e„ (6.18) 

which shows the claimed exponential decay and also that 

limsup£;„ ^ B{1 - fj.y-^akiz)e^. 



7 Beyond the convex case: ^T^-minimization for r < 1 

If $ has the NSP of order K with 7 < 1, then (see ^ ^i-minimization recovers -ff-sparse solutions 
to $a; = y for any y G W"" that admits such a /c-sparse solution, i.e., ^i-minimization gives also 
^o-minimizers, provided their support has size at most k. In [29], Gribonval and Nielsen showed 
that in this case, ^i-minimization also gives the ^^-minimizers, i.e., £i-minimization also solves 
non-convex optimization problems of the type 

X* = argmin llzlljiv, for < r < 1. (7-1) 



Let us first recall the results of [29] that are of most interest to us here, reformulated for our 
setting and notations. 

Lemma 7.1 ([291 Theorem 2]). Assume that x* is a K -sparse vector in J-{y) and that < r ^ 1. 



X] l^»l^ < X] 1^*1^ ' °'^' <^Quivalently, ^ |r/i|^ < 9 ^ '^' 
ieT ieT': ieT i=l 



T 
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for all 7] eM and for allT C {I,..., N} with #T ^ K, then 



X = argmm ||2;||^jv 
zeny) 



Lemma 7.2 ([291 Theorem 5]). Let z £ R^ , < n ^ T2 ^ 1, and K £ N. Then 

sup — j^ ^ sup — j^ . 

TC{1,...,N},#T^K X^j^i |zj|^i Tc{1,...,N},#T<:K J2i=l NiT^ 

Combining these two lemmas with the observations in ^ leads immediately to the following 
result. 

Theorem 7.3 Fix any < r ^ 1. If ^ satisfies the NSP of order K with constant 7 then 

/or allrj^M and for all T C {!,..., N} such that #T ^ K. 

In addition, if 'j < 1, and if there exists a K-sparse vector in J-{y), then this K -sparse vector 
is the unique minimizer in J-{y) of \\ ■ \\i^. 

At first sight, these results suggest there is nothing to be gained by carrying out ir- rather than 
£i-minimization; in addition sparse recovery via the non-convex problems ()7.ip is much harder than 
the more easily solvable convex relaxation problem of £i-minimization. 

Yet, we shall show in this section that ^t— minimization has unexpected benefits, and that it 
may be both useful and practically feasible via an IRLS approach. Before we start, it is expedient 
to introduce the following definition: we shall say that $ has the t-NuU Space Property (r-NSP) 
of order K with constant 7 > if, for all sets T of cardinality at most K and all rj G N , 

WtItWIn ^ 7lhT=||£]y • (7.3) 

In what follows we shall construct an IRLS algorithm for H.^-- minimization. We shall see that 

(a) In practice, £,-- minimization can be carried out by an IRLS algorithm. Hence, the non- 
convexity does not necessarily make the problem intractable; 

(b) In particular, if <1> satisfies the r-NSP of order K, and if there exists a fc-sparse vector x* in 
J^{y), with k < K — K for suitable k given below, then the IRLS algorithm converges to the 
£^-minimizer x"^, which, therefore, will coincide with x*; 

(c) Surprisingly the rate of local convergence of the algorithm is superlinear; the rate is larger 
for smaller r, increasing to approach a quadratic regime as r — > 0. More precisely, we will 
show that the local error En '■= Hx" — a;*||J^ satisfies 

En+i^Kl,r)El-\ (7.4) 
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where /x(7, r) < 1 for 7 > sufficiently small. The validity of (j7.4p is restricted to x" in a 
(small) ball centered at x* . In particular, if x" is close enough to x* then (|7.4p ensures the 
convergence of the algorithm to the fc-sparse solution x* . 

Some of these virtues of £T--minimization were recently highlighted by Chartrand and his col- 
laborators [m [T2I [T3] . Chartrand and Staneva [13j give a fine analysis of the RIP from which they 
can conclude that £,-- minimization not only recovers fc-sparse vectors, but that the range of k for 
which this recovery works is larger for smaller r. Namely, for random Gaussian matrices, they prove 
that with high probability on the draw of the matrix sparse recovery by ^t— minimization works for 
k < ?T7,[ci(r) + TC2{T)log{N/k)]~^ , where ci(r) is bounded and C2(r) decreases to zero as r — > 0. 
In particular, the dependence of the sparsity k on the number A^ of columns vanishes for r — > 0. 
These bounds give a quantitative estimate of the improvement provided by £,— minimization vis a 
vis £i-minimization for which the range of fc-sparsity for having exact recovery is clearly smaller 
(see Figure 8.4 for a numerical illustration). 

7.1 Some useful properties of ij. spaces 

We start by listing in one proposition some fundamental and well-known properties of ij- spaces for 
< r ^ 1. For further details we refer the reader to, e.g., |19j . 



Proposition 7.4 

(i) Assume < r ^ 1. Then the map z 1-^ ||-2||^jv defines a quasi-norm for M , in particular the 
triangle inequality holds up to a constant, i.e., 

\\u + v\\iN ^ C{t) ( ||u||fiv -|- \\v\\iN j , for all u,v £ U. . (7-5) 

If one considers the r-th powers of the "r-norm", then one has the so-called "r-triangle inequality": 

\\u -\- vW^N ^ \\u\\Jn + \\v\\In, for all u,v ^M. . (7-6) 

(a) We have, for any < ri ^ T2 ^ 00 

\\u\\e^ !^\\u\\e^, forallueR^. (7.7) 

We will refer to this norm estimate by writing the embedding relation i!^^ ^-> £^ . 

(Hi) ( Generalized Holder inequality) For < r ^ 1 and < p,q < 00 such that - = — \- -, and for 

a positive weight vector w = {wi)^^ we have 

\\{uiVi)^=i\\iN^^) ^ \\u\\iNi^^)\\v\\i,N^^), for allu,ve'R^, (7.8) 

where \\v\\(^Nf^\ := ( X^j=i |wj|^it)i j , as usual , for < r < 00. 
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For technical reasons, it is often more convenient to employ the r-triangle inequality (j7.6p than 
(j7.5p : in this sense, for £,— minimization || • \\lff turns out to be more natural as a measure of error 
than the quasi-norm || • \\^n. 

In order to prove the three claims (a)-(c) listed before the start of this subsection, we also 
need to generalize to ir certain results previously shown only for ii. In the following we assume 
< r ^ 1. We denote by 

crk{z)gN ■.= "^r{z)l, 

the error of the best /c-term approximation to z with respect to || • \\Zn- As a straightforward gen- 
eralization of analogous results valid for the £i-norm, we have the following two technical lemmas. 



Lemma 7.5 For any j € {1, . . . , N}, we have 

\aj{z)£N — aj{z')£N\ ^ \\z — z'W^N, 

for all z, z' G M^. Moreover, for any J > j, we have 

{J - j)r{z)'j ^ aj{z)^N ^ ||z - z'll^jv +(Tj(z')^jy. 

Lemma 7.6 Assume that ^ has the t-NSP of order K with constant < 7 < 1. Then, for any 
z^z' G ^{y), we have 

\\z' - zlljiv < _ f lk'll[iy - Iklljiv + 2aK{z)^NJ . 

The proofs of these lemmas are essentially identical to the ones of Lemma 14.11 and Lemma 14.21 
except for substituting || • ||^jv for || • |Ljv and o-fc(-)^jv for ak{-)eN respectively. 

7.2 An IRLS algorithm for ^^-minimization 

To define an IRLS algorithm promoting ^^-minimization for a generic < r ^ 1, we first define a 
r-dependent functional Jr, generalizing J: 



T 

Jr{z,w,e) := - 



Yl ^i^j + Yl ^'^i + — — i^ 

i = l j = l \ Wj " 



zGM^\w; GMi;,e G1R+. (7.9) 



The desired algorithm is then defined simply by substituting J'r for J in Algorithm [H keeping the 
same update rule (II. 7p for e. In particular we have 



„n+l 



2-T 

2 



^J'^ = l(^r) +^n+l ' i = l'---'^' 
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and 

N 

i=i 
Fundamental properties of the algorithm are derived in the same way as before. In particular, 
the values J'r{x^,w'^,en) decrease monotonically, 

X(x"+\«;"+\e,+i) ^ J, (x", «;",£„), n ^ 0, 

and the iterates are bounded, 

Ik^lljiv ^Jr{x^,w°,eo) := Aq. 

As in Lemma 14.41 the weights are uniformly bounded from below, i.e., 

w]^Ao, j = l,...,N. 

Moreover, using Jr for J' in Lemma 15. H we can again prove the asymptotic regularity of the 
iterations, i.e., 

lim ||x'"+^ - x^lLiv = 0. 

The first significant difference with the ^i-case arises when e = lim„__+oo ^n > 0. In this latter 
situation, we need to consider the function 

fJiz):=J2{z] + e'y^. (7.10) 

We denote by Z^^riu) its set of minimizers on .F(y)(since /e,r is no longer convex it may have more 
than one minimizer). Even though every minimizer z G Z^^riv) still satisfies 

{z, ri)w = 0, for all rj e J\f, 

where w = w'^''^'^ is defined by w'^-'^'^ = {{zj)'^ + e^)^~, j = 1, . . . ,N, the converse need no longer 
be true. 

The following theorem summarizes the convergence properties on the algorithm in the case 
r < 1. 

Theorem 7.7 Fix y G M . Let K (the same index as in the update rule ()1.7p ) be chosen so that 

$ satisfies the t-NSP of order K with a constant 7 such that 7 < 1 — -j^rp^- ^^^ ^e,T{y) be the set of 

accumulation points of {x'^)n£f^, and define e := liniri^oo ^n- Then, the algorithm has the following 

properties: 

(i) If e = 0, then Z^^r{y) consists of a single point x, the x^"^' converge to x, and x is an i-r-iT^i'n'imizer 

in T{y) which is also K-sparse. 

(a) If e > 0, then for each x £ Z^^T-iv) we have {x,r])^t,T,x = 0, for all r] G M. 

(Hi) If z £ ^{y) and x G Z^^T-{y) n Z^^T-{y), we have 

\\z — xW^N ^ C2(Tk{z)(^N , 
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forallk<K--^. 

The proof of this theorem uses Lemmas 17.1117.61 and fohows the same arguments as for Theorem 

Remark 7.8 Unhke Theorem 15. 31 Theorem 17. 71 does not ensure that the IRLS algorithm converges 
to the sparsest or to the minimal ^T-solution. It does provide conditions that are verifiable a 
posteriori (e.g., e = lim„^oo ^n = 0) for such convergence. The reason for this weaker result is the 
non-convexity of fj. (In particular, it might happen that x''''^ is a local minimizer of f[, but not a 
global one, and the estimate in (iii) does not necessarily hold.) Nevertheless, as is often the case 
for non-convex problems, we can establish a local convergence result that also highlights the rate 
we can expect for such convergence. This is the content of the following section; it will be followed 
by numerical results that dovetail nicely with the theoretical results. 

7.3 Local super linear convergence 

Throughout this section, we assume that there exists a /c-sparse vector x* in J-{y). We define the 
error vectors r/" = x" — x* G N; we now measure the error by || • \\1^: 

^n ■— 11'/ WiN- 

Theorem 7.9 Assume that <I> has the t-NSP of order K with constant 7 G (0, 1) and that J-{y) 
contains a k sparse vector x* with k < K . (Here K is the same as in the definition of e^ in the 
update rule ()1.7p in Algorithm^) Suppose that, for a given < p < 1, we have 

En^ ^ R* := [pr{x*)kY (7.11) 

and define 

p := /.(p, K, 7, r, N) = 21-^7(1 + iW [ 1 + [ jf_^[\ j} j \ ^■■= {r{x*)lr^{l - p)^-) "^ . 

If p and 7 are sufficiently small so that 

p{E*)^~^ =pp-(i--)r(x*);('^"^ ^ 1, (7.12) 

then for all n ^ uq we have 

En+i ^ pEl-\ (7.13) 

Proof: The proof is by induction on n. We assume that E^ < R* and derive (|7.13|) . As in 
the proof of Theorem 16. H we let T denote the support of x* and so #(T) = k and r{x*)k is the 
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smallest entry in x* . Following the proof of Theorem 16. H the first few lines are the same. The first 
difference is in the following estimate, which holds for i £ T and replaces (j6.3p . 



'^^■' ^ . I^-I < 



((xr)2 + e2)l-/2 - |a;*+r/,"|2-- (|x*|(l-p))2- 

1 



\x*\^-^il-p) 



2-T 



<A. 



Starting with the orthogonality relation ()6.2I) and using the above inequality and the embedding 

l/r 



i^^^i^, we obtain 



i=i VieT / 



We now apply the r-NSP to find 

/ N 



l^^^'lllV) = E K^'\'< ^ lA^r,-tX?- (7-14) 



.1=1 



2t 



At the same time, the generalized Holder inequality (see Proposition l7.4l (iii)) for p = 2 and q — 2_^, 
together with the above estimates, yields 



I n+l||2T _ Wdn^+^lfnnri^-l/T^N ||2r 

„.)ll(«)-'/^)f=lll| 

I''' Wff n\ — l/T\N ||2- 

\£N\\{\Wi ) )i=l\\iN (w"-T'') 



^ /l''"ll„"+l l|T III'/' n\ — l/T\N ||2t 



In other words, 

T/{2-r)y 

Let us now estimate the weight term. By the ^-triangle inequality (j7.6p we have 



h?ti,V ^ iA^mw^)-'/^)l,\\% (7.15) 



2-r 



('7/l"'|-^/^'l^ ||2r 

'^2r/(2-T)V"' I-' -I 






TV \ 2- / TV \ 2- 



Now, an application of Lemma 17.51 gives the following estimates 

^ I ll™« ^*I|T I I \ I ||„n||T ' 



- \ '^~r^ z II'? lie 



K+l-k" ""-r J \K + l-k 
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Using these estimates in (j7.15p gives 



l#'ii;,<2>-^.^'(i + (^^f^) 



2-r 




and (I7.13P follows by a further application of the r-NSP (see (I6.5p ). 

Because of the assumption (I7.12p . we also have £"«+! ^ R* and so the induction can continue. 



Remark 7.10 In contrast to the ii case, we do not need fj, < 1 to ensure that En decreases. In fact, 
all that is needed for the error reduction is fJ-E^"'^ < 1 for some sufficiently large n. In fact, fi could 
be quite large in cases where the smallest non-zero component of the sparse vector is very small. 
We have not observed this effect in our examples; we expect that our analysis, although apparently 
accurate in describing the rate of convergence (see section 8), is too pessimistic in estimating the 
coefhcient fi. 

8 Numerical results 

In this section we present numerical experiments that illustrate that the bounds derived in the 
theoretical analysis do manifest themselves in practice. 

8.1 Convergence rates 

We start with numerical results that confirm the linear rate of convergence of our iteratively re- 
weighted least square algorithm for ^i-minimization, and its robust recovery of sparse vectors. In 
the experiments we used a matrix $ of dimensions m x N and Gaussian A'^(0, 1/m) i.i.d. entries. 
Such matrices are known to possess (with high probability) the RIP property with optimal bounds 
[21 m [35] . In Figure 18.11 we depict the approximation error to the unique sparsest solution shown 
in Figure [821 arid the instantaneous rate of convergence. The numerical results both confirm the 
expected linear rate of convergence and the robust reconstruction of the sparse vector. 

Next, we compare the linear convergence achieved with ^i-minimization with the super lin- 
ear convergence obtained by the iteratively re-weighted least square algorithm promoting i^-- 
minimization. 

In Figure [831 we are interested in the comparison of the rate of convergence when our algorithm 
is used for different choices of < r ^ 1. For r = 1, .8, .6 and .56, the figure shows the error, 
as a function of the iteration step n, for the iterative algorithm, with different fixed values of 
r. For r = 1, the rate is linear, as in Figure 18. 1[ For the smaller values r = .8, .6 and .56 the 
iterations initially follow the same linear rate; once they are sufficiently close to the sparse solution, 
the convergence rate speeds up dramatically, suggesting we have entered the region of validity of 
(jT.lSp . For smaller values of r numerical experiments do not always lead to convergence: in some 
cases the algorithm never got to the neighborhood of the solution where convergence is ensured. 
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Figure 8.1: An experiment, with a matrix $ of size 250 x 1500 with Gaussian A^(0, 25q) i-i.d. entries, 
in which recovery is sought of the 45-sparse vector x* represented in Figure 18.21 from its image 
y = $x. Left: plot of log;^Q(||x"' — a;*||^^) as a function of n, where the x" are generated by Algorithm 
[ll with e„ defined adaptively, as in (jl.7p . Note that the scale in the ordinate axis does not report 
the logarithm 0, —1, —2, . . ., but the corresponding accuracies 10^, 10"^, 10"^, . . . for ||x" — x*||^^. 
The graph also plots e„ as a function of n. Right: plot of the ratios ||x" — x"''"^||£j/||x'' 
and (e„ — e„+i)/(e„_i — e^) for the same examples. 
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Figure 8.2: The sparse vector used in the example illustrated in Figure 18.11 This vector has 
dimension 1500, but only 45 non-zero entries. 



However, in this case a combination of initial iterations with the £i-inspired IRLS (for which we 
always have convergence) and later iterations with ^T-inspired IRLS for smaller r allow again for a 
very fast convergence to the sparsest solution; this is illustrated in Figure [8^31 for the case r = .5. 



8.2 Enhanced recovery in compressed sensing and relationship with other work 

Candes, Wakin, and Boyd [8] showed, by numerical experimentation, that iteratively re-weighted £i- 
minimization, with weights suggested by an £o-™iiiimization goal, can enhance the range of sparsity 
for which perfect reconstruction of a sparse vector "works" in compressed sensing. In experiments 
with iteratively re- weighted ^2-iiiinimization algorithms, Chartrand and several collaborators ob- 
served a similar significant improvement [TTl [T^l [131 HH ES] ; see in particular [T31 Section 4] ; we 
also illustrate this in Figure 18.41 It is to be noted that IRLS algorithms are computationally much 
less demanding than weighted ^i-minimization. In addition, there is, as far as we know, no analysis 
(as yet) for re-weighted ^i-minimization that is comparable to the detailed theoretical analysis of 
convergence presented here of our IRLS algorithm, which seems to give a realistic picture of the 
numerical computations. 
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Figure 8.3: We show the decay of logarithmic error, as a function of the number of iterations of the 
algorithm for different values of r (1, 0.8, 0.6, 0.56). We show also the results of an experiment in 
which the initial 10 iterations are performed with r = 1 and the remaining iterations with r = 0.5. 
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